Константы и подключение библиотек¶

In [1]:
%load_ext autoreload
%autoreload 2
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.io as pio
import plotly.express as px
import hdbscan
import shap
from bokeh.plotting import curdoc

import lightgbm as lgb
import catboost as cb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

from functools import partial

from utils.distribution import get_df_info, DistributionPlotter
from utils.reductors import make_tsne, make_umap
from utils.drplotter import DimReductionPlotter
from utils.lgbm import plot_feature_info, plot_scores, plot_tree_info

sns.set_style("dark")
plt.style.use("dark_background")
pio.templates.default = "plotly_dark"
curdoc().theme = "dark_minimal"
shap.initjs()
No description has been provided for this image
In [3]:
WEBSTAT_DATASET_PATH = "./datasets/t1_webstat.csv"
TRAIN_DATASET_PATH = "./datasets/train.csv"
TEST_DATASET_PATH = "./datasets/test.csv"
SUBMISSION_PATH = "./datasets/submission.csv"
SAMPLE_SUBMISSION_PATH = "./datasets/sample_submission.csv"

Анализ датасетов¶

Webstat¶

Первичный осмотр¶

Отсортируем сразу по sessionkey_id и pageview_number

In [4]:
web = pd.read_csv(WEBSTAT_DATASET_PATH)
web["date_time"] = pd.to_datetime(web["date_time"])
web = web.sort_values(["sessionkey_id", "date_time"])
web.head()
Out[4]:
sessionkey_id date_time page_type pageview_number pageview_duration_sec category_id model_id good_id price product_in_sale
2268917 109996122 1975-10-17 13:42:56.953 2 1 11.0 722.0 NaN NaN NaN NaN
2268918 109996122 1975-10-17 13:43:07.510 2 2 22.0 7196.0 NaN NaN NaN NaN
2268919 109996122 1975-10-17 13:43:29.860 2 3 25.0 779.0 NaN NaN NaN NaN
2269206 109996122 1975-10-17 13:43:54.757 2 4 9.0 7196.0 NaN NaN NaN NaN
2267445 109996122 1975-10-17 13:44:03.803 2 5 11.0 723.0 NaN NaN NaN NaN
In [5]:
get_df_info(web)
Out[5]:
dtype nunique nan zero empty string example(-s) mode, mode proportion trash_score
product_in_sale float64 2 n: 0.633 NaN NaN (1.0, nan) (1.0, 1.0) 1.000
good_id float64 233144 n: 0.633 NaN NaN (57794032.0, 66632395.0) (66921494.0, 0.001) 0.633
price float64 12299 n: 0.633 NaN NaN (59.0, 12481.0) (952.0, 0.004) 0.633
model_id float64 181760 n: 0.613 NaN NaN (19237096.0, 1734006.0) (18340251.0, 0.002) 0.613
category_id float64 3549 n: 0.294 NaN NaN (4012.0, 4553.0) (155.0, 0.054) 0.294
pageview_duration_sec float64 2975 n: 0.088 z: 0.006 NaN (-13608.0, -6658.0) (9.0, 0.025) 0.094
sessionkey_id int64 328430 NaN NaN NaN (113210921, 117494105) (119635649.0, 0.0) NaN
date_time datetime64[ns] 3329535 NaN NaN NaN (1975-12-24 18:26:33.407000, 1975-12-17 23:03:... (1976-01-25 22:35:55.557000, 0.0) NaN
page_type int64 13 NaN NaN NaN (11, 8) (1.0, 0.387) NaN
pageview_number int64 632 NaN NaN NaN (329, 248) (1.0, 0.097) NaN

Куча нанов, неприятно :/

Посмотрим на наны в последних колонках¶
In [6]:
for column in ("category_id", "model_id", "good_id", "price", "product_in_sale"):
    print(column)
    print(web[~web[column].isna()]["page_type"].value_counts())
    print()
category_id
page_type
1    1289570
2     932848
4     133137
Name: count, dtype: int64

model_id
page_type
1    1289578
Name: count, dtype: int64

good_id
page_type
1    1225243
Name: count, dtype: int64

price
page_type
1    1225243
Name: count, dtype: int64

product_in_sale
page_type
1    1225243
Name: count, dtype: int64

Ух ты, можно сделать вывод, что page_type = 1 это страница с товаром!

In [7]:
web[(web["page_type"] == 1)]
Out[7]:
sessionkey_id date_time page_type pageview_number pageview_duration_sec category_id model_id good_id price product_in_sale
2268628 110019268 1975-10-17 15:27:58.257 1 2 43.0 206.0 8748965.0 22312252.0 2986.0 1.0
2268629 110020180 1975-10-17 15:29:52.147 1 1 NaN 147.0 1513237.0 55614318.0 4490.0 1.0
2269208 110040418 1975-10-17 17:05:41.530 1 1 25.0 1200.0 1827718.0 10547740.0 726.0 1.0
2268920 110040418 1975-10-17 17:06:06.163 1 2 43.0 1200.0 1827718.0 10547740.0 726.0 1.0
2267447 110040418 1975-10-17 17:07:45.243 1 6 55.0 1200.0 14122715.0 28114543.0 430.0 1.0
... ... ... ... ... ... ... ... ... ... ...
2267442 134628743 1976-02-16 20:56:31.220 1 2 4.0 127.0 19246197.0 NaN NaN NaN
2074763 134628743 1976-02-16 20:56:54.337 1 4 66.0 127.0 17200183.0 NaN NaN NaN
2267443 134628743 1976-02-16 20:58:12.070 1 6 9.0 127.0 9401923.0 NaN NaN NaN
2267444 134628743 1976-02-16 20:58:26.223 1 8 3.0 127.0 17200183.0 NaN NaN NaN
2011581 134629277 1976-02-16 20:58:06.137 1 1 16.0 NaN NaN NaN NaN NaN

1291547 rows × 10 columns

Но всё равно даже тут есть наны(

Нарисуем распределения¶
In [8]:
web_plotter = DistributionPlotter(web)
web_plotter.plot_all()
web_plotter.show_plot()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Странные данные 0_0¶

Поле pageview_duration_sec вообще какое-то странное. Есть наны и отрицательные числа...

Давайте смотреть, как так вышло!

In [9]:
web[(web["pageview_duration_sec"] < 0)]
Out[9]:
sessionkey_id date_time page_type pageview_number pageview_duration_sec category_id model_id good_id price product_in_sale
2270348 110328896 1975-10-19 15:51:10.010 1 16 -1.0 2873.0 144660.0 65175298.0 1178.0 1.0
2272147 110422717 1975-10-20 00:37:21.590 1 12 -6.0 1241.0 16890898.0 62773803.0 732.0 1.0
2272150 110422717 1975-10-20 01:08:05.410 4 33 -15.0 1229.0 NaN NaN NaN NaN
2273210 110422717 1975-10-20 01:08:06.030 4 32 -1.0 5673.0 NaN NaN NaN NaN
2272429 110467977 1975-10-20 10:58:39.170 1 1 -1.0 1330.0 3563114.0 20279782.0 1099.0 1.0
... ... ... ... ... ... ... ... ... ... ...
2051829 134596933 1976-02-16 18:34:59.390 3 4 -7.0 NaN NaN NaN NaN NaN
2056596 134609922 1976-02-16 20:08:51.777 2 23 -9.0 7790.0 NaN NaN NaN NaN
2056601 134609922 1976-02-16 20:20:10.567 2 46 -6.0 7323.0 NaN NaN NaN NaN
2266220 134616502 1976-02-16 20:03:29.110 2 6 -1.0 201.0 NaN NaN NaN NaN
2059338 134621944 1976-02-16 20:29:41.960 1 5 -3.0 127.0 1799088.0 37969387.0 5331.0 1.0

2739 rows × 10 columns

In [10]:
web[(web["sessionkey_id"] == 110328896)]
Out[10]:
sessionkey_id date_time page_type pageview_number pageview_duration_sec category_id model_id good_id price product_in_sale
2271131 110328896 1975-10-19 15:09:36.890 2 1 1.0 2873.0 NaN NaN NaN NaN
2270344 110328896 1975-10-19 15:09:37.010 2 2 204.0 1241.0 NaN NaN NaN NaN
2270345 110328896 1975-10-19 15:13:01.207 1 3 361.0 1241.0 27621026.0 65231588.0 360.0 1.0
2270346 110328896 1975-10-19 15:19:02.770 2 4 122.0 2446.0 NaN NaN NaN NaN
2269942 110328896 1975-10-19 15:21:04.087 2 5 292.0 1333.0 NaN NaN NaN NaN
2269943 110328896 1975-10-19 15:25:56.770 1 6 434.0 1333.0 9208426.0 64816713.0 405.0 1.0
2270347 110328896 1975-10-19 15:33:10.330 3 7 49.0 NaN NaN NaN NaN NaN
2270731 110328896 1975-10-19 15:33:59.310 2 8 NaN 1183.0 NaN NaN NaN NaN
2270732 110328896 1975-10-19 15:37:10.950 1 11 365.0 1241.0 29232485.0 63119872.0 252.0 1.0
2270733 110328896 1975-10-19 15:43:15.350 1 12 83.0 1241.0 27704320.0 60621585.0 334.0 1.0
2270734 110328896 1975-10-19 15:44:38.830 3 13 73.0 NaN NaN NaN NaN NaN
2269944 110328896 1975-10-19 15:45:51.810 5 14 94.0 NaN NaN NaN NaN NaN
2270735 110328896 1975-10-19 15:47:25.613 1 15 225.0 2873.0 209585.0 60429426.0 1178.0 1.0
2271132 110328896 1975-10-19 15:51:09.210 5 17 114.0 NaN NaN NaN NaN NaN
2270348 110328896 1975-10-19 15:51:10.010 1 16 -1.0 2873.0 144660.0 65175298.0 1178.0 1.0
2269945 110328896 1975-10-19 15:53:03.030 1 18 0.0 2873.0 6369236.0 63392163.0 1271.0 1.0
2271133 110328896 1975-10-19 15:53:03.570 5 19 34.0 NaN NaN NaN NaN NaN
2270736 110328896 1975-10-19 15:53:37.990 5 20 53.0 NaN NaN NaN NaN NaN
2271134 110328896 1975-10-19 15:54:30.190 2 21 18.0 2873.0 NaN NaN NaN NaN
2270737 110328896 1975-10-19 15:54:48.030 2 22 NaN 2873.0 NaN NaN NaN NaN

Как можно заметить, тут просто перепутался порядок в сессии

In [11]:
web[(web["sessionkey_id"] == 133729636)]
Out[11]:
sessionkey_id date_time page_type pageview_number pageview_duration_sec category_id model_id good_id price product_in_sale
2250161 133729636 1976-02-10 13:54:36.863 1 1 249.0 1200.0 136805.0 28904311.0 2264.0 1.0
2044139 133729636 1976-02-10 13:58:45.423 1 2 3229.0 1200.0 136805.0 28904311.0 2264.0 1.0
2033903 133729636 1976-02-10 13:59:12.187 1 3 579.0 1200.0 19566244.0 62771283.0 1892.0 1.0
2033904 133729636 1976-02-10 14:08:51.613 3 4 97.0 NaN NaN NaN NaN NaN
2033905 133729636 1976-02-10 14:18:04.980 1 8 1.0 5605.0 132912.0 19870269.0 1790.0 1.0
1909261 133729636 1976-02-10 14:18:05.097 1 9 29.0 1200.0 136805.0 28904311.0 2264.0 1.0
1909262 133729636 1976-02-10 14:18:34.330 3 10 36.0 NaN NaN NaN NaN NaN
2033906 133729636 1976-02-10 14:19:10.247 1 11 17.0 1200.0 136805.0 28904311.0 2264.0 1.0
1909263 133729636 1976-02-10 14:19:27.727 3 12 8.0 NaN NaN NaN NaN NaN
2250162 133729636 1976-02-10 14:45:35.320 1 1 -2810.0 1200.0 136805.0 28904311.0 2264.0 1.0
2250163 133729636 1976-02-10 14:51:14.417 1 2 80.0 1200.0 136805.0 28904311.0 2264.0 1.0
2250164 133729636 1976-02-10 14:52:34.117 1 3 -2623.0 1200.0 136805.0 28904311.0 2264.0 1.0
2044140 133729636 1976-02-10 14:55:36.570 3 4 -2708.0 NaN NaN NaN NaN NaN
2250165 133729636 1976-02-10 15:07:49.180 1 1 -4144.0 1200.0 136805.0 28904311.0 2264.0 1.0

А здесь уже другая проблема, тут 3 сессии в одной. И они считают pageview_duration_sec через date_time друг друга

Train¶

Первичный осмотр¶
In [12]:
tr = pd.read_csv(TRAIN_DATASET_PATH, index_col="order_id")
tr["create_time"] = pd.to_datetime(tr["create_time"])
tr["model_create_time"] = pd.to_datetime(tr["model_create_time"])
tr.head()
Out[12]:
create_time good_id price utm_medium utm_source sessionkey_id category_id parent_id root_id model_id is_moderated rating_value rating_count description_length goods_qty pics_qty model_create_time is_callcenter
order_id
1269921 1975-12-26 09:30:08 9896348 753 5 8.0 123777004 139 133 124 123517 1 5.0 6.0 1204 6 2 1971-04-14 00:15:20 1
1270034 1975-12-26 10:28:57 9896348 753 1 2.0 123781654 139 133 124 123517 1 5.0 6.0 1204 6 2 1971-04-14 00:15:20 0
1268272 1975-12-25 11:24:28 9896348 753 2 3.0 123591002 139 133 124 123517 1 5.0 6.0 1204 6 2 1971-04-14 00:15:20 1
1270544 1975-12-26 14:16:06 9896348 753 1 1.0 123832302 139 133 124 123517 1 5.0 6.0 1204 6 2 1971-04-14 00:15:20 1
1270970 1975-12-26 18:21:47 9896348 753 3 56.0 123881603 139 133 124 123517 1 5.0 6.0 1204 6 2 1971-04-14 00:15:20 0
In [13]:
get_df_info(tr)
Out[13]:
dtype nunique nan zero empty string example(-s) mode, mode proportion trash_score
is_moderated int64 2 NaN z: 0.049 NaN (0, 1) (1.0, 0.951) 0.951
rating_value float64 11 n: 0.677 NaN NaN (6.0, 10.0) (5.0, 0.672) 0.677
is_callcenter int64 2 NaN z: 0.645 NaN (0, 1) (0.0, 0.645) 0.645
rating_count float64 30 n: 0.507 z: 0.136 NaN (35.0, 13.0) (1.0, 0.285) 0.642
description_length int64 3106 NaN z: 0.384 NaN (802, 43) (0.0, 0.384) 0.384
utm_source float64 289 n: 0.1 NaN NaN (6.0, 227.0) (1.0, 0.476) 0.100
model_create_time datetime64[ns] 31697 n: 0.01 NaN NaN (1974-12-22 19:30:29, 1974-05-15 21:33:11) (1975-02-10 17:16:18, 0.005) 0.010
pics_qty int64 34 NaN z: 0.005 NaN (19, 16) (1.0, 0.366) 0.005
create_time datetime64[ns] 102998 NaN NaN NaN (1976-01-03 11:35:20, 1976-01-01 13:30:59) (1976-01-20 10:49:10, 0.0) NaN
good_id int64 53691 NaN NaN NaN (59724690, 32240662) (66921494.0, 0.002) NaN
price int64 6362 NaN NaN NaN (2444, 152) (264.0, 0.009) NaN
utm_medium int64 8 NaN NaN NaN (1, 6) (1.0, 0.457) NaN
sessionkey_id int64 96803 NaN NaN NaN (121264750, 112943865) (125996889.0, 0.0) NaN
category_id int64 1733 NaN NaN NaN (7178, 3554) (155.0, 0.09) NaN
parent_id int64 368 NaN NaN NaN (1542, 7948) (154.0, 0.094) NaN
root_id int64 26 NaN NaN NaN (1481, 2303) (1183.0, 0.264) NaN
model_id int64 37299 NaN NaN NaN (18677008, 4142405) (18340251.0, 0.005) NaN
goods_qty int64 114 NaN NaN NaN (55, 8) (1.0, 0.319) NaN
Нарисуем распределения¶
In [14]:
tr_plotter = DistributionPlotter(tr, hue_col="is_callcenter")
tr_plotter.plot_all()
tr_plotter.show_plot()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Test¶

Первичный осмотр¶
In [15]:
tst = pd.read_csv(TEST_DATASET_PATH, index_col="order_id")
tst["create_time"] = pd.to_datetime(tst["create_time"])
tst["model_create_time"] = pd.to_datetime(tst["model_create_time"])
tst.head()
Out[15]:
create_time good_id price utm_medium utm_source sessionkey_id category_id parent_id root_id model_id is_moderated rating_value rating_count description_length goods_qty pics_qty model_create_time
order_id
1350922 1976-02-05 15:08:37 9896348 1143 1 2.0 132744630 139 133 124 123517 1 5.0 6.0 1204 6 2 1971-04-14 00:15:20
1354989 1976-02-07 15:26:00 69445048 1707 1 1.0 133161905 136 133 124 123551 1 10.0 0.0 2010 26 3 1971-04-14 00:15:20
1352637 1976-02-06 11:43:58 70607886 576 1 1.0 132792626 136 133 124 123583 1 3.0 4.0 0 34 7 1971-04-14 00:15:20
1350050 1976-02-05 11:26:19 61918401 436 1 1.0 132683062 236 232 201 124228 1 4.0 1.0 0 2 4 1971-04-21 00:09:54
1341733 1976-02-01 19:36:32 37964900 573 6 4.0 131789790 138 133 124 123901 1 5.0 1.0 0 37 2 1971-04-16 10:52:08
In [16]:
get_df_info(tst)
Out[16]:
dtype nunique nan zero empty string example(-s) mode, mode proportion trash_score
is_moderated int64 2 NaN z: 0.059 NaN (0, 1) (1.0, 0.941) 0.941
rating_value float64 11 n: 0.7 NaN NaN (6.0, 10.0) (5.0, 0.712) 0.700
rating_count float64 31 n: 0.539 z: 0.13 NaN (35.0, 21.0) (1.0, 0.293) 0.669
description_length int64 2050 NaN z: 0.423 NaN (647, 1737) (0.0, 0.423) 0.423
utm_source float64 130 n: 0.09 NaN NaN (53.0, 11.0) (1.0, 0.439) 0.090
model_create_time datetime64[ns] 9443 n: 0.011 NaN NaN (1975-05-20 17:22:57, 1975-08-29 02:06:07) (1975-02-10 17:16:18, 0.006) 0.011
pics_qty int64 28 NaN z: 0.006 NaN (30, 47) (1.0, 0.381) 0.006
create_time datetime64[ns] 16934 NaN NaN NaN (1976-02-09 18:42:42, 1976-02-05 11:40:07) (1976-02-13 14:02:08, 0.0) NaN
good_id int64 12183 NaN NaN NaN (68707597, 31785151) (59028240.0, 0.003) NaN
price int64 3299 NaN NaN NaN (1312, 3965) (271.0, 0.007) NaN
utm_medium int64 8 NaN NaN NaN (6, 4) (1.0, 0.404) NaN
sessionkey_id int64 16019 NaN NaN NaN (132350004, 133939423) (132712616.0, 0.0) NaN
category_id int64 1071 NaN NaN NaN (3591, 4791) (155.0, 0.106) NaN
parent_id int64 296 NaN NaN NaN (7373, 7941) (154.0, 0.11) NaN
root_id int64 24 NaN NaN NaN (1481, 1478) (1183.0, 0.237) NaN
model_id int64 10239 NaN NaN NaN (8289546, 21430000) (18340251.0, 0.006) NaN
goods_qty int64 105 NaN NaN NaN (10, 103) (1.0, 0.273) NaN

Ничего необычного, всё как в трейне

Нарисуем распределния¶
In [17]:
tst_plotter = DistributionPlotter(tst)
tst_plotter.plot_all()
tst_plotter.show_plot()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Небольшие выводы по тесту¶

Признак is_moderated распределенём по другому относительно train, поэтому думаю его лучше не использовать. Остальные признаки распределены примерно также

Ещё можно заметить, что sessionkey_id начинают идти после тех, что были в train, поэтому можно предположить, что мы прогнозируем данные из будущего

In [18]:
tr["create_time"].max() < tst["create_time"].min()
Out[18]:
True

Это действительно так, так что для валидации будем отделим заказы по времени

EDA¶

Работа с сессиями¶

Первое, что хотелось бы сделать: сагрегировать данные с сессии

In [17]:
web.columns  # Чтоб не забыть :)
Out[17]:
Index(['sessionkey_id', 'date_time', 'page_type', 'pageview_number',
       'pageview_duration_sec', 'category_id', 'model_id', 'good_id', 'price',
       'product_in_sale'],
      dtype='object')
In [18]:
(web["product_in_sale"].isna() == web["good_id"].isna()).all()
Out[18]:
True

product_in_sale бесполезная колонка :/

In [19]:
agg_params = {
    "session_length": ("sessionkey_id", lambda x: x.shape[0]),
    #
    "session_datetime_start": ("date_time", lambda x: x.iloc[0]),
    "session_datetime_end": ("date_time", lambda x: x.iloc[-1]),
    #
    "last_page_type": ("page_type", lambda x: x.iloc[-1]),
    **{
        f"page_type_{i}": ("page_type", partial(lambda x, i: x[x == i].count(), i=i))
        for i in (3, 6)
    },
    #
    # **{
    #     f"page_type_{i}": ("page_type", partial(lambda x, i: x[x == i].count(), i=i))
    #     for i in range(1, 13 + 1)
    # }, # Самыми полезными получились 3 и 6, чтоб долго не считать сделал только их
    #
    #
    "last_pageview_number": ("pageview_number", lambda x: x.max()),
    #
    "pageview_duration_sec_last": ("pageview_duration_sec", lambda x: x.iloc[-1]),
    "pageview_duration_sec_sum": ("pageview_duration_sec", lambda x: np.nansum(x)),
    "pageview_duration_sec_min": ("pageview_duration_sec", lambda x: x.min()),
    "pageview_duration_sec_max": ("pageview_duration_sec", lambda x: x.max()),
    #
    "categories": ("category_id", lambda x: set(x[~x.isna()].astype(int))),
    #
    "models": ("model_id", lambda x: set(x[~x.isna()].astype(int))),
    #
    "goods": ("good_id", lambda x: set(x[~x.isna()].astype(int))),
    #
    "price_min": ("price", lambda x: x.min()),
    "price_max": ("price", lambda x: x.max()),
}

web_aggregate = web.groupby("sessionkey_id", sort=False).agg(**agg_params)
In [20]:
web_aggregate["datetime_diff"] = (
    web_aggregate["session_datetime_end"] - web_aggregate["session_datetime_start"]
).dt.total_seconds()
web_aggregate["timedelta_1"] = (
    web_aggregate["datetime_diff"] - web_aggregate["pageview_duration_sec_sum"]
)

for i in (3, 6):
    web_aggregate[f"page_type_{i}_proportion"] = (
        web_aggregate[f"page_type_{i}"] / web_aggregate["session_length"]
    )

# for i in range(1, 13 + 1):
#     web_aggregate[f"page_type_{i}_proportion"] = (
#         web_aggregate[f"page_type_{i}"] / web_aggregate["session_length"]
#     )

web_aggregate.sample(5)
Out[20]:
session_length session_datetime_start session_datetime_end last_page_type page_type_3 page_type_6 last_pageview_number pageview_duration_sec_last pageview_duration_sec_sum pageview_duration_sec_min pageview_duration_sec_max categories models goods price_min price_max datetime_diff timedelta_1 page_type_3_proportion page_type_6_proportion
sessionkey_id
131876093 3 1976-02-01 17:34:26.507 1976-02-01 17:36:39.717 6 1 1 6 NaN 33.0 8.0 25.0 {4449} {27730843} {60396422} 256.0 256.0 133.210 100.210 0.333333 0.333333
125297077 12 1976-01-03 18:11:41.850 1976-01-03 18:27:26.823 3 2 1 15 NaN 648.0 13.0 102.0 {1701} {12164171, 23209991, 16750511} {42056988, 42056958, 42056959} 353.0 494.0 944.973 296.973 0.166667 0.083333
125782858 1 1976-01-06 11:39:00.443 1976-01-06 11:39:00.443 1 0 0 1 NaN 0.0 NaN NaN {1214} {3745181} {29541343} 2212.0 2212.0 0.000 0.000 0.000000 0.000000
113213995 6 1975-11-02 15:40:19.243 1975-11-02 15:53:13.903 3 2 0 10 NaN 502.0 11.0 410.0 {257, 2873} {209585, 17044329} {60391505, 59847471} 1115.0 1271.0 774.660 272.660 0.333333 0.000000
125681519 3 1976-01-05 19:44:44.897 1976-01-05 19:46:24.053 1 0 0 3 NaN 100.0 32.0 68.0 {1200} {1531418} {30456057} 2204.0 2204.0 99.156 -0.844 0.000000 0.000000

Создание обущающей выборки¶

In [21]:
def transform(data: pd.DataFrame, web_aggregate: pd.DataFrame):
    data_transformed = data.join(web_aggregate, "sessionkey_id")
    data_transformed["timedelta_2"] = (
        data_transformed["create_time"] - data_transformed["session_datetime_start"]
    ).dt.total_seconds()
    data_transformed["timedelta_3"] = (
        data_transformed["session_datetime_end"] - data_transformed["create_time"]
    ).dt.total_seconds()

    X = data_transformed.drop(
        columns=[
            "create_time",
            "model_create_time",
            "session_datetime_start",
            "session_datetime_end",
            "sessionkey_id",
            "categories",
            "models",
            "goods",
            "is_moderated",
        ]
    )

    if "is_callcenter" in data_transformed.columns:
        return X.drop(columns=["is_callcenter"]), X.is_callcenter.values

    return X
In [23]:
X, y = transform(tr, web_aggregate)
In [30]:
sns.boxplot(y=X["page_type_3"], hue=y, showfliers=False)
plt.show()
No description has been provided for this image
In [31]:
sns.boxplot(y=X["page_type_6"], hue=y, showfliers=False)
plt.show()
No description has been provided for this image
In [32]:
sns.boxplot(y=X["timedelta_3"], hue=y, showfliers=False)
plt.show()
No description has been provided for this image

Обучение и анализ модели (1 балл)¶

In [23]:
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.25, shuffle=False)
In [24]:
get_df_info(X)
Out[24]:
dtype nunique nan zero empty string example(-s) mode, mode proportion trash_score
pageview_duration_sec_last float64 710 n: 0.804 z: 0.001 NaN (115.0, 183.0) (6.0, 0.048) 0.805
page_type_6 float64 37 n: 0.006 z: 0.673 NaN (31.0, 13.0) (0.0, 0.677) 0.679
page_type_6_proportion float64 832 n: 0.006 z: 0.673 NaN (0.225, 0.13846153846153847) (0.0, 0.677) 0.679
rating_value float64 11 n: 0.677 NaN NaN (6.0, 10.0) (5.0, 0.672) 0.677
rating_count float64 30 n: 0.507 z: 0.136 NaN (35.0, 13.0) (1.0, 0.285) 0.642
description_length int64 3106 NaN z: 0.384 NaN (802.0, 43.0) (0.0, 0.384) 0.384
page_type_3_proportion float64 1253 n: 0.006 z: 0.375 NaN (0.2459016393442623, 0.13690476190476192) (0.0, 0.377) 0.381
page_type_3 float64 54 n: 0.006 z: 0.375 NaN (17.0, 56.0) (0.0, 0.377) 0.381
pageview_duration_sec_min float64 1346 n: 0.102 z: 0.049 NaN (903.0, 136.0) (4.0, 0.086) 0.151
datetime_diff float64 83298 n: 0.006 z: 0.098 NaN (1993.894, 2653.873) (0.0, 0.099) 0.104
pageview_duration_sec_max float64 1875 n: 0.102 z: 0.001 NaN (13393.0, 621.0) (30.0, 0.003) 0.103
timedelta_1 float64 56348 n: 0.006 z: 0.097 NaN (245.52999999999997, 707.4899999999998) (0.0, 0.098) 0.103
pageview_duration_sec_sum float64 6252 n: 0.006 z: 0.097 NaN (821.0, 3399.0) (0.0, 0.097) 0.103
utm_source float64 289 n: 0.1 NaN NaN (6.0, 227.0) (1.0, 0.476) 0.100
price_min float64 5712 n: 0.081 NaN NaN (316.0, 8213.0) (264.0, 0.005) 0.081
price_max float64 7489 n: 0.081 NaN NaN (1719.0, 7930.0) (952.0, 0.005) 0.081
last_page_type float64 14 n: 0.006 NaN NaN (10.0, 11.0) (1.0, 0.437) 0.006
session_length float64 253 n: 0.006 NaN NaN (208.0, 12.0) (1.0, 0.099) 0.006
timedelta_2 float64 99461 n: 0.006 NaN NaN (1854.56, 2090.04) (581.183, 0.0) 0.006
last_pageview_number float64 253 n: 0.006 NaN NaN (155.0, 8.0) (1.0, 0.098) 0.006
timedelta_3 float64 96662 n: 0.006 NaN NaN (13.903, -2965.903) (37.277, 0.0) 0.006
pics_qty int64 34 NaN z: 0.005 NaN (19.0, 16.0) (1.0, 0.366) 0.005
good_id int64 53691 NaN NaN NaN (59724690.0, 32240662.0) (66921494.0, 0.002) NaN
price int64 6362 NaN NaN NaN (2444.0, 152.0) (264.0, 0.009) NaN
utm_medium int64 8 NaN NaN NaN (1.0, 6.0) (1.0, 0.457) NaN
category_id int64 1733 NaN NaN NaN (7178.0, 3554.0) (155.0, 0.09) NaN
parent_id int64 368 NaN NaN NaN (1542.0, 7948.0) (154.0, 0.094) NaN
root_id int64 26 NaN NaN NaN (1481.0, 2303.0) (1183.0, 0.264) NaN
model_id int64 37299 NaN NaN NaN (18677008.0, 4142405.0) (18340251.0, 0.005) NaN
goods_qty int64 114 NaN NaN NaN (55.0, 8.0) (1.0, 0.319) NaN
In [25]:
cat_features = [
    "utm_medium",
    # "good_id",
    # "category_id",
    # "parent_id",
    "root_id",
    # "model_id",
    "last_page_type",
]
In [26]:
train_dataset = lgb.Dataset(X_tr, y_tr, categorical_feature=cat_features)
val_dataset = lgb.Dataset(X_val, y_val, categorical_feature=cat_features)

model = lgb.train(
    {
        "boosting_type": "dart",
        "eta": 0.15,
        "objective": "binary",
        "metric": ["auc", ""],
        "neg_bagging_fraction": 0.2,
    },
    train_dataset,
    100,
    [val_dataset],
    ["Validation"],
    callbacks=[
        lgb.log_evaluation(3),
    ],
)
t = model.trees_to_dataframe()
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero
[LightGBM] [Info] Number of positive: 28110, number of negative: 50336
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.015987 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5261
[LightGBM] [Info] Number of data points in the train set: 78446, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.358336 -> initscore=-0.582595
[LightGBM] [Info] Start training from score -0.582595
[3]	Validation's auc: 0.957261
[6]	Validation's auc: 0.957913
[9]	Validation's auc: 0.958335
[12]	Validation's auc: 0.959779
[15]	Validation's auc: 0.960384
[18]	Validation's auc: 0.962162
[21]	Validation's auc: 0.963195
[24]	Validation's auc: 0.963843
[27]	Validation's auc: 0.964151
[30]	Validation's auc: 0.964368
[33]	Validation's auc: 0.964561
[36]	Validation's auc: 0.964596
[39]	Validation's auc: 0.964716
[42]	Validation's auc: 0.964841
[45]	Validation's auc: 0.9649
[48]	Validation's auc: 0.965022
[51]	Validation's auc: 0.965084
[54]	Validation's auc: 0.965132
[57]	Validation's auc: 0.965187
[60]	Validation's auc: 0.965206
[63]	Validation's auc: 0.965315
[66]	Validation's auc: 0.965247
[69]	Validation's auc: 0.965418
[72]	Validation's auc: 0.965437
[75]	Validation's auc: 0.965394
[78]	Validation's auc: 0.965377
[81]	Validation's auc: 0.965421
[84]	Validation's auc: 0.965449
[87]	Validation's auc: 0.965543
[90]	Validation's auc: 0.965529
[93]	Validation's auc: 0.965575
[96]	Validation's auc: 0.965572
[99]	Validation's auc: 0.965552
In [27]:
plot_scores(model, X_tr, y_tr, X_val, y_val)
No description has been provided for this image

Можно заметить пики в середнине raw_scores, дальше я обращу на них внимание

In [28]:
plot_tree_info(t)
No description has been provided for this image
In [ ]:
plot_feature_info(t)

2

In [30]:
t.query("split_feature == 'page_type_3'")
Out[30]:
tree_index node_depth node_index left_child right_child parent_index split_feature split_gain threshold decision_type missing_direction missing_type value weight count
0 0 1 0-S0 0-S1 0-S2 None page_type_3 26844.900391 0.0 <= left NaN -0.321022 0.0000 78446
61 1 1 1-S0 1-S1 1-S2 None page_type_3 19225.099609 0.0 <= left NaN 0.000000 0.0000 78446
122 2 1 2-S0 2-S1 2-S2 None page_type_3 14347.599609 0.0 <= left NaN 0.000000 0.0000 78446
183 3 1 3-S0 3-S1 3-S2 None page_type_3 10951.400391 0.0 <= left NaN 0.000000 0.0000 78446
244 4 1 4-S0 4-S1 4-S2 None page_type_3 8475.990234 0.0 <= left NaN 0.000000 0.0000 78446
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5795 95 1 95-S0 95-S1 95-S2 None page_type_3 104.882004 0.0 <= left NaN 0.000000 0.0000 78446
5843 95 9 95-S24 95-L11 95-S25 95-S23 page_type_3 10.002700 1.5 <= left NaN -0.002012 327.6110 3022
5856 96 1 96-S0 96-S1 96-S2 None page_type_3 35.998798 0.0 <= left NaN 0.000000 0.0000 78446
5917 97 1 97-S0 97-S1 97-S2 None page_type_3 121.742996 0.0 <= left NaN 0.000000 0.0000 78446
6011 98 9 98-S15 98-L10 98-S16 98-S13 page_type_3 10.774100 1.5 <= left NaN -0.022046 82.1623 938

99 rows × 15 columns

page_type_3 используется в основном только на 1 сплите по порогу 0. Возможно стоит сделать её бинарной, чтобы не переобучать модель.

In [31]:
t.query("split_feature == 'timedelta_3'")
Out[31]:
tree_index node_depth node_index left_child right_child parent_index split_feature split_gain threshold decision_type missing_direction missing_type value weight count
1 0 2 0-S1 0-S10 0-S9 0-S0 timedelta_3 10157.799805 -2164.93 <= left NaN -0.192504 6872.6400 29890
3 0 4 0-S11 0-S16 0-S18 0-S10 timedelta_3 172.850006 -5370.1865 <= right NaN -0.286102 2387.1500 10382
20 0 5 0-S22 0-L16 0-S26 0-S15 timedelta_3 51.521900 -1955.455 <= left NaN -0.095339 3621.8800 15752
30 0 4 0-S14 0-L1 0-L15 0-S3 timedelta_3 146.151001 -539.3265 <= left NaN -0.433084 2299.7700 10002
34 0 5 0-S5 0-L4 0-S12 0-S4 timedelta_3 3377.439941 -1955.455 <= left NaN -0.215084 1462.5900 6361
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6057 99 6 99-S6 99-S18 99-S16 99-S5 timedelta_3 18.795799 -435.3185 <= left NaN 0.046040 491.6040 19541
6075 99 2 99-S2 99-S10 99-S7 99-S0 timedelta_3 18.351500 0.0 <= left NaN -0.036730 643.1310 25025
6077 99 4 99-S12 99-S20 99-L13 99-S10 timedelta_3 11.489500 -539.3265 <= left NaN -0.058996 91.6342 3624
6082 99 4 99-S11 99-S15 99-L12 99-S10 timedelta_3 15.029200 -198.09 <= left NaN 0.004037 249.5390 1625
6095 99 7 99-S28 99-L28 99-S29 99-S27 timedelta_3 17.743099 36.192 <= left NaN 0.132610 12.9024 382

732 rows × 15 columns

Что-то странное модель использует timedelta_3 на всех деревьях при том всегда не на 1 уровне. И это не эффект маленького eta, оно делает так всегда.

Так же у неё всегда разный threshold. Возможно стоит использовать линейную модель на ней

In [ ]:
px.scatter(X, x="timedelta_3", y=X["page_type_3"] > 0, color=y).update_layout(
    yaxis_title="page_type_3 > 0",
)

1

Но что-то не похоже, что тут оптимально будет применять линейную модель :/

Но зато видно как timedelta_3 помогла в моменте, где page_type_3 > 0. Думаю если почистить данные, можно добиться большего скора!

И ещё возможно не стоит ставить много деревьев в модель, чтобы она не начала захватывать вкрапления синих и жёлтых точек. Я думаю, что она это и делает, поэтому использует timedelta_3 на всех деревьях

Благодаря этому, я попытаюсь сделать стабильную модель и выберу её второй для private leaderboard

Блок с баллами (26 баллов)¶

1. Понижение размерности (5 баллов)¶

In [34]:
raise Exception  # Чтоб не перезапускало ячейки
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[34], line 1
----> 1 raise Exception  # Чтоб не перезапускало ячейки

Exception: 
Попытка ввести понижение размерности на признаках, хорошо отражающих объекты¶

В обучающей выборке много категориальных колонок. При том они не интепритируемы (по крайней мере лично мной), потому что нам не дали никакой информации о том, что значат их значения :(

Дерево разделило utm_medium на группы (1, 3, 4, 5) и (6, 7), это можно использовать

In [35]:
model.trees_to_dataframe().query("split_feature == 'utm_medium'")[:5]
Out[35]:
tree_index node_depth node_index left_child right_child parent_index split_feature split_gain threshold decision_type missing_direction missing_type value weight count
2 0 3 0-S10 0-S11 0-S17 0-S1 utm_medium 280.731995 1||3||4||5 == right NaN -0.300278 3196.040 13900
63 1 3 1-S9 1-S22 1-S11 1-S1 utm_medium 202.996994 6||7 == right NaN 0.014432 3225.800 13900
124 2 3 2-S10 2-S14 2-S20 2-S1 utm_medium 155.783005 1||3||4||5 == right NaN 0.018100 3244.450 13900
132 2 4 2-S20 2-S22 2-L21 2-S10 utm_medium 44.249699 2||7 == right NaN -0.021110 793.688 3518
185 3 3 3-S11 3-S14 3-S21 3-S1 utm_medium 117.456001 1||3||4||5 == right NaN 0.015245 3306.490 14101
In [36]:
features = [
    "page_type_3",
    "timedelta_3",
    "timedelta_1",
    "pageview_duration_sec_last",
    "page_type_6",
    "timedelta_2",
    "price",
    "pageview_duration_sec_max",
    "price_min",
    "page_type_3_proportion",
    "utm_medium_score",
    "price_max",
    "pageview_duration_sec_min",
    "pageview_duration_sec_sum",
    "page_type_6_proportion",
]


def transform_dr(X: pd.DataFrame, y: np.ndarray = None):
    data = X.copy()
    utm_medium_mapper = {
        1: 100,
        2: 1,
        3: 100,
        4: 100,
        5: 100,
        6: 10,
        7: 10,
        8: 1,
    }
    data["utm_medium_score"] = data["utm_medium"].map(utm_medium_mapper)
    if y is not None:
        data["is_callcenter"] = y

    data.fillna(-100, inplace=True)

    return data


get_df_info(transform_dr(X, y))
Out[36]:
dtype nunique nan zero empty string example(-s) mode, mode proportion trash_score
pageview_duration_sec_last float64 710 NaN z: 0.001 NaN (201.0, 189.0) (-100.0, 0.804) 0.804
page_type_6 float64 37 NaN z: 0.673 NaN (18.0, 15.0) (0.0, 0.673) 0.673
page_type_6_proportion float64 832 NaN z: 0.673 NaN (0.14754098360655737, 0.13846153846153847) (0.0, 0.673) 0.673
is_callcenter int64 2 NaN z: 0.645 NaN (0.0, 1.0) (0.0, 0.645) 0.645
description_length int64 3106 NaN z: 0.384 NaN (802.0, 43.0) (0.0, 0.384) 0.384
page_type_3 float64 54 NaN z: 0.375 NaN (12.0, 35.0) (0.0, 0.375) 0.375
page_type_3_proportion float64 1253 NaN z: 0.375 NaN (0.22058823529411764, 0.13690476190476192) (0.0, 0.375) 0.375
rating_count float64 30 NaN z: 0.136 NaN (34.0, 20.0) (-100.0, 0.507) 0.136
datetime_diff float64 83298 NaN z: 0.098 NaN (1993.894, 1852.664) (0.0, 0.098) 0.098
pageview_duration_sec_sum float64 6252 NaN z: 0.097 NaN (3139.0, 3353.0) (0.0, 0.097) 0.097
timedelta_1 float64 56347 NaN z: 0.097 NaN (-0.15699999999998226, -0.4860000000001037) (0.0, 0.097) 0.097
pageview_duration_sec_min float64 1346 NaN z: 0.049 NaN (664.0, 665.0) (-100.0, 0.102) 0.049
pics_qty int64 34 NaN z: 0.005 NaN (19.0, 16.0) (1.0, 0.366) 0.005
pageview_duration_sec_max float64 1875 NaN z: 0.001 NaN (6573.0, 1189.0) (-100.0, 0.102) 0.001
good_id int64 53691 NaN NaN NaN (59724690.0, 32240662.0) (66921494.0, 0.002) NaN
price int64 6362 NaN NaN NaN (2444.0, 152.0) (264.0, 0.009) NaN
utm_medium int64 8 NaN NaN NaN (1.0, 6.0) (1.0, 0.457) NaN
utm_source float64 289 NaN NaN NaN (35.0, 273.0) (1.0, 0.428) NaN
category_id int64 1733 NaN NaN NaN (7178.0, 3554.0) (155.0, 0.09) NaN
parent_id int64 368 NaN NaN NaN (1542.0, 7948.0) (154.0, 0.094) NaN
root_id int64 26 NaN NaN NaN (1481.0, 2303.0) (1183.0, 0.264) NaN
model_id int64 37299 NaN NaN NaN (18677008.0, 4142405.0) (18340251.0, 0.005) NaN
rating_value float64 11 NaN NaN NaN (4.0, 5.0) (-100.0, 0.677) NaN
goods_qty int64 114 NaN NaN NaN (55.0, 8.0) (1.0, 0.319) NaN
session_length float64 253 NaN NaN NaN (383.0, 12.0) (1.0, 0.098) NaN
last_page_type float64 14 NaN NaN NaN (-100.0, 12.0) (1.0, 0.435) NaN
last_pageview_number float64 253 NaN NaN NaN (265.0, 8.0) (1.0, 0.097) NaN
price_min float64 5712 NaN NaN NaN (1774.0, 2098.0) (-100.0, 0.081) NaN
price_max float64 7489 NaN NaN NaN (16529.0, 2992.0) (-100.0, 0.081) NaN
timedelta_2 float64 99461 NaN NaN NaN (364.843, 4837.257) (-100.0, 0.006) NaN
timedelta_3 float64 96662 NaN NaN NaN (-159.94, -2965.903) (-100.0, 0.006) NaN
utm_medium_score int64 3 NaN NaN NaN (100.0, 1.0) (100.0, 0.734) NaN
In [ ]:
data = transform_dr(X_tr, y_tr)

mapper_dict = {
    "TSNE 2D": {
        "params": {
            "perplexity": 30,
            "n_components": 2,
            #
            "n_jobs": 32,
            "verbose": False,
        },
        "func": make_tsne,
    },
    "UMAP 2D without y": {
        "params": {
            "n_neighbors": 15,
            "min_dist": 0.1,
            "n_components": 2,
            #
            "n_jobs": 32,
            "verbose": False,
        },
        "func": make_umap,
    },
    "UMAP 2D with y": {
        "params": {
            "n_neighbors": 15,
            "min_dist": 0.1,
            "n_components": 2,
            #
            "n_jobs": 32,
            "verbose": False,
        },
        "func": partial(make_umap, y=data["is_callcenter"]),
    },
}

drplotter = DimReductionPlotter()
results = drplotter.plot_dim_reduction(
    data,
    mapper_dict,
    default_features=features,
    default_hue_info=("is_callcenter", True),
)

3

Картинки для разных is_callcenter получаются почти одинаковыми :(. Только UMAP с метками работает неплохо

При этом есть длинные хвосты, которые скорее всего появились из выбросов в данных (нескольких сессий в одной и отрицательных значений). Их можно исправить руками

In [46]:
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=200,
    gen_min_span_tree=True,
    prediction_data=True,
)

X_tr["cluster_id"] = clusterer.fit_predict(results["UMAP 2D with y"]["embedding"])
X_val["cluster_id"] = hdbscan.approximate_predict(
    clusterer,
    results["UMAP 2D with y"]["mapper"].transform(transform_dr(X_val)[features].values),
)[0]
In [47]:
train_dataset = lgb.Dataset(
    X_tr, y_tr, categorical_feature=cat_features + ["cluster_id"]
)
val_dataset = lgb.Dataset(
    X_val, y_val, categorical_feature=cat_features + ["cluster_id"]
)

model = lgb.train(
    {
        "boosting_type": "dart",
        "eta": 0.2,
        "objective": "binary",
        "metric": ["auc", ""],
        "neg_bagging_fraction": 0.2,
    },
    train_dataset,
    100,
    [val_dataset],
    ["Validation"],
    callbacks=[
        lgb.log_evaluation(3),
    ],
)
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
[LightGBM] [Info] Number of positive: 28110, number of negative: 50336
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001420 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5283
[LightGBM] [Info] Number of data points in the train set: 78446, number of used features: 31
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.358336 -> initscore=-0.582595
[LightGBM] [Info] Start training from score -0.582595
[3]	Validation's auc: 0.808559
[6]	Validation's auc: 0.809859
[9]	Validation's auc: 0.810023
[12]	Validation's auc: 0.798178
[15]	Validation's auc: 0.797292
[18]	Validation's auc: 0.889415
[21]	Validation's auc: 0.892297
[24]	Validation's auc: 0.890831
[27]	Validation's auc: 0.887568
[30]	Validation's auc: 0.889944
[33]	Validation's auc: 0.888973
[36]	Validation's auc: 0.886435
[39]	Validation's auc: 0.889596
[42]	Validation's auc: 0.889336
[45]	Validation's auc: 0.889704
[48]	Validation's auc: 0.891321
[51]	Validation's auc: 0.891555
[54]	Validation's auc: 0.889247
[57]	Validation's auc: 0.888259
[60]	Validation's auc: 0.888992
[63]	Validation's auc: 0.887077
[66]	Validation's auc: 0.888238
[69]	Validation's auc: 0.886502
[72]	Validation's auc: 0.886932
[75]	Validation's auc: 0.890603
[78]	Validation's auc: 0.890796
[81]	Validation's auc: 0.890969
[84]	Validation's auc: 0.894233
[87]	Validation's auc: 0.893101
[90]	Validation's auc: 0.892958
[93]	Validation's auc: 0.891136
[96]	Validation's auc: 0.8913
[99]	Validation's auc: 0.890424
In [49]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_tr, model.predict(X_tr))
Out[49]:
0.9986133563103403

Слишком сильно переобучились из-за даталика в UMAP :(

In [50]:
X_tr.drop(columns=["cluster_id"], inplace=True, errors="ignore")
X_val.drop(columns=["cluster_id"], inplace=True, errors="ignore")
А что если не is_callcenter¶
In [ ]:
data = transform_dr(X_tr, y_tr)

mapper_dict = {
    "TSNE 2D": {
        "params": {
            "perplexity": 30,
            "n_components": 2,
            #
            "n_jobs": 32,
            "verbose": False,
        },
        "func": make_tsne,
    },
    "UMAP 2D without y": {
        "params": {
            "n_neighbors": 15,
            "min_dist": 0.1,
            "n_components": 2,
            #
            "n_jobs": 32,
            "verbose": False,
        },
        "func": make_umap,
    },
    "UMAP 2D with y": {
        "params": {
            "n_neighbors": 15,
            "min_dist": 0.1,
            "n_components": 2,
            #
            "n_jobs": 32,
            "verbose": False,
        },
        "func": partial(make_umap, y=data["utm_medium"]),
    },
}

drplotter = DimReductionPlotter()
results = drplotter.plot_dim_reduction(
    data,
    mapper_dict,
    default_features=features,
    default_hue_info=("utm_medium", True),
)

4

Те же самые хвосты, но utm_medium разбросан почти равномерно

In [ ]:
data = transform_dr(X_tr, y_tr)

mapper_dict = {
    "TSNE 2D": {
        "params": {
            "perplexity": 30,
            "n_components": 2,
            #
            "n_jobs": 32,
            "verbose": False,
        },
        "func": make_tsne,
    },
    "UMAP 2D without y": {
        "params": {
            "n_neighbors": 15,
            "min_dist": 0.1,
            "n_components": 2,
            #
            "n_jobs": 32,
            "verbose": False,
        },
        "func": make_umap,
    },
}

drplotter = DimReductionPlotter()
results = drplotter.plot_dim_reduction(
    data,
    mapper_dict,
    default_features=features,
    default_hue_info=("root_id", False),
)

5

Тут также :/

Посмотрим на скоры¶
In [54]:
train_dataset = lgb.Dataset(X_tr, y_tr, categorical_feature=cat_features)
val_dataset = lgb.Dataset(X_val, y_val, categorical_feature=cat_features)

model = lgb.train(
    {
        "boosting_type": "dart",
        "eta": 0.15,
        "objective": "binary",
        "metric": ["auc", ""],
        "neg_bagging_fraction": 0.2,
    },
    train_dataset,
    100,
    [val_dataset],
    ["Validation"],
    callbacks=[
        lgb.log_evaluation(3),
    ],
)
t = model.trees_to_dataframe()
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero
[LightGBM] [Info] Number of positive: 28110, number of negative: 50336
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.073822 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5261
[LightGBM] [Info] Number of data points in the train set: 78446, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.358336 -> initscore=-0.582595
[LightGBM] [Info] Start training from score -0.582595
[3]	Validation's auc: 0.957261
[6]	Validation's auc: 0.957913
[9]	Validation's auc: 0.958335
[12]	Validation's auc: 0.959779
[15]	Validation's auc: 0.960384
[18]	Validation's auc: 0.962162
[21]	Validation's auc: 0.963195
[24]	Validation's auc: 0.963843
[27]	Validation's auc: 0.964151
[30]	Validation's auc: 0.964368
[33]	Validation's auc: 0.964561
[36]	Validation's auc: 0.964596
[39]	Validation's auc: 0.964716
[42]	Validation's auc: 0.964841
[45]	Validation's auc: 0.9649
[48]	Validation's auc: 0.965022
[51]	Validation's auc: 0.965084
[54]	Validation's auc: 0.965132
[57]	Validation's auc: 0.965187
[60]	Validation's auc: 0.965206
[63]	Validation's auc: 0.965315
[66]	Validation's auc: 0.965247
[69]	Validation's auc: 0.965418
[72]	Validation's auc: 0.965437
[75]	Validation's auc: 0.965394
[78]	Validation's auc: 0.965377
[81]	Validation's auc: 0.965421
[84]	Validation's auc: 0.965449
[87]	Validation's auc: 0.965543
[90]	Validation's auc: 0.965529
[93]	Validation's auc: 0.965575
[96]	Validation's auc: 0.965572
[99]	Validation's auc: 0.965552
In [55]:
leaves = model.predict(X_val, pred_leaf=True)

scores = np.array(
    [
        [model.get_leaf_output(i, leaves[j, i]) for i in range(leaves.shape[1])]
        for j in range(leaves.shape[0])
    ]
)
scores = pd.DataFrame(scores, columns=map(str, range(100)))
scores["is_callcenter"] = y_val
In [ ]:
mapper_dict = {
    "TSNE 2D": {
        "params": {
            "perplexity": 20,
            "n_components": 2,
            #
            "n_jobs": 32,
            "verbose": False,
        },
        "func": make_tsne,
    },
    "UMAP 2D": {
        "params": {
            "n_neighbors": 11,
            "min_dist": 0.1,
            "n_components": 2,
            #
            "n_jobs": 32,
            "verbose": False,
        },
        "func": make_umap,
    },
}
scores_drplotter = DimReductionPlotter()
_ = scores_drplotter.plot_dim_reduction(
    scores, mapper_dict, list(map(str, range(100))), ("is_callcenter", True)
)

6

У обоих методов есть пересечение 2 классов, при этом все остальные разделяются более-менее хорошо. Они соответствуют пикам из гистограммы raw_score, где модель не уверена в метке

Облачко на сессиях :-)¶
In [59]:
web_aggregate_features = [
    "session_length",
    "page_type_3",
    "page_type_6",
    "last_pageview_number",
    # "pageview_duration_sec_last",
    "pageview_duration_sec_sum",
    # "pageview_duration_sec_min",
    # "pageview_duration_sec_max",
    # "price_min",
    # "price_max",
    "datetime_diff",
    "timedelta_1",
    "page_type_3_proportion",
    "page_type_6_proportion",
]
In [ ]:
mapper_dict = {
    "TSNE 2D perplexity=15": {
        "params": {
            "perplexity": 15,
            "n_components": 2,
            #
            "n_jobs": 32,
            "verbose": False,
        },
        "func": make_tsne,
    },
    "TSNE 2D perplexity=30": {
        "params": {
            "perplexity": 30,
            "n_components": 2,
            #
            "n_jobs": 32,
            "verbose": False,
        },
        "func": make_tsne,
    },
}

drplotter = DimReductionPlotter()
results = drplotter.plot_dim_reduction(
    web_aggregate[
        (web_aggregate.index.isin(set(tr["sessionkey_id"]) | set(tst["sessionkey_id"])))
    ],
    mapper_dict,
    web_aggregate_features,
)

7

Что-то интересное! 1 облако точек

In [63]:
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=100,
    gen_min_span_tree=True,
    prediction_data=True,
)

clusters = clusterer.fit_predict(results["TSNE 2D perplexity=30"]["embedding"])
In [65]:
sns.scatterplot(
    x=results["TSNE 2D perplexity=30"]["embedding"][:, 0],
    y=results["TSNE 2D perplexity=30"]["embedding"][:, 1],
    hue=clusters,
)
plt.show()
No description has been provided for this image
In [66]:
web_aggregate.loc[
    (web_aggregate.index.isin(set(tr["sessionkey_id"]) | set(tst["sessionkey_id"]))),
    "cluster_id",
] = clusters
In [67]:
X, y = transform(tr, web_aggregate)
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.25, shuffle=False)

train_dataset = lgb.Dataset(
    X_tr, y_tr, categorical_feature=cat_features + ["cluster_id"]
)
val_dataset = lgb.Dataset(
    X_val, y_val, categorical_feature=cat_features + ["cluster_id"]
)

model = lgb.train(
    {
        "boosting_type": "dart",
        "eta": 0.15,
        "objective": "binary",
        "metric": ["auc", ""],
        "neg_bagging_fraction": 0.2,
    },
    train_dataset,
    100,
    [val_dataset],
    ["Validation"],
    callbacks=[
        lgb.log_evaluation(3),
    ],
)
t = model.trees_to_dataframe()
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero
[LightGBM] [Info] Number of positive: 28110, number of negative: 50336
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003494 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5268
[LightGBM] [Info] Number of data points in the train set: 78446, number of used features: 31
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.358336 -> initscore=-0.582595
[LightGBM] [Info] Start training from score -0.582595
[3]	Validation's auc: 0.957261
[6]	Validation's auc: 0.957913
[9]	Validation's auc: 0.958335
[12]	Validation's auc: 0.959779
[15]	Validation's auc: 0.960384
[18]	Validation's auc: 0.962162
[21]	Validation's auc: 0.963195
[24]	Validation's auc: 0.963843
[27]	Validation's auc: 0.964151
[30]	Validation's auc: 0.964368
[33]	Validation's auc: 0.964561
[36]	Validation's auc: 0.964596
[39]	Validation's auc: 0.964716
[42]	Validation's auc: 0.964841
[45]	Validation's auc: 0.9649
[48]	Validation's auc: 0.965022
[51]	Validation's auc: 0.965084
[54]	Validation's auc: 0.965132
[57]	Validation's auc: 0.965187
[60]	Validation's auc: 0.965206
[63]	Validation's auc: 0.965315
[66]	Validation's auc: 0.965247
[69]	Validation's auc: 0.965418
[72]	Validation's auc: 0.965437
[75]	Validation's auc: 0.965394
[78]	Validation's auc: 0.965377
[81]	Validation's auc: 0.965421
[84]	Validation's auc: 0.965449
[87]	Validation's auc: 0.965543
[90]	Validation's auc: 0.965529
[93]	Validation's auc: 0.965575
[96]	Validation's auc: 0.965572
[99]	Validation's auc: 0.965552
In [68]:
dict(zip(X_tr.columns, model.feature_importance("gain")))["cluster_id"]
Out[68]:
0.0

Не особо полезная фича :* (Когда запускал 1 раз, оно хотя бы 1 раз делило по нему)

In [70]:
web_aggregate.drop(columns=["cluster_id"], inplace=True, errors="ignore")
X, y = transform(tr, web_aggregate)
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.25, shuffle=False)
Мини итог¶

В целом мне кажется понижать размерность и пытаться кластеризовать на этих данных не особо много смысла. Основная информация содержится в парочке переменных, на которых сложно ввести нормальную метрику, потому что они не численные

Но я попытался

2. Кластеризация (3 балла)¶

В своей модели я не использовал информацию о товарах из сессий и хочу исправить это

Так как это категориальные фичи будет довольно сложно сделать это метрически (я пробовал метрику Jaccard и Hamming, но в матрице расстояний было только 2 уникальных значения 0 на диагонали и 1 в остальных случаях :/), поэтому я решил попробовать для этого дерево

Кластеризация через листья¶
In [71]:
web_cat = web.loc[
    web["page_type"].isin([1, 2, 4]),
    ["sessionkey_id", "category_id", "model_id", "good_id", "price"],
].set_index("sessionkey_id")
web_cat
Out[71]:
category_id model_id good_id price
sessionkey_id
109996122 722.0 NaN NaN NaN
109996122 7196.0 NaN NaN NaN
109996122 779.0 NaN NaN NaN
109996122 7196.0 NaN NaN NaN
109996122 723.0 NaN NaN NaN
... ... ... ... ...
134628743 127.0 NaN NaN NaN
134628743 127.0 9401923.0 NaN NaN
134628743 127.0 NaN NaN NaN
134628743 127.0 17200183.0 NaN NaN
134629277 NaN NaN NaN NaN

2483376 rows × 4 columns

In [72]:
cluster_tr = tr.join(web_cat, "sessionkey_id", "inner", lsuffix="_tr")[
    [
        "category_id",
        "model_id",
        "good_id",
        "price",
        "is_callcenter",
    ]
]
X_cluster, y_cluster = (
    cluster_tr.drop(columns="is_callcenter"),
    cluster_tr.is_callcenter.values,
)
In [73]:
tree = lgb.train({"num_leaves": 10}, lgb.Dataset(X_cluster, y_cluster), 1)
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002363 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1020
[LightGBM] [Info] Number of data points in the train set: 1046532, number of used features: 4
[LightGBM] [Info] Start training from score 0.285141
In [74]:
tree.trees_to_dataframe()
Out[74]:
tree_index node_depth node_index left_child right_child parent_index split_feature split_gain threshold decision_type missing_direction missing_type value weight count
0 0 1 0-S0 0-S1 0-S5 None price 1464.050049 793.5 <= left NaN 0.285141 0 1046532
1 0 2 0-S1 0-S2 0-L2 0-S0 category_id 635.273010 3974.5 <= right NaN 0.282752 743326 743326
2 0 3 0-S2 0-S4 0-S3 0-S1 category_id 278.243988 132.5 <= left NaN 0.284679 518231 518231
3 0 4 0-S4 0-L0 0-L5 0-S2 category_id 139.705994 125.5 <= left NaN 0.293974 30318 30318
4 0 5 0-L0 None None 0-S4 None NaN NaN None None None 0.280335 6019 6019
5 0 5 0-L5 None None 0-S4 None NaN NaN None None None 0.297353 24299 24299
6 0 4 0-S3 0-L3 0-S6 0-S2 price 175.552002 541.5 <= right NaN 0.284101 487913 487913
7 0 5 0-L3 None None 0-S3 None NaN NaN None None None 0.280539 107786 107786
8 0 5 0-S6 0-S7 0-L7 0-S3 category_id 113.382004 1326.5 <= left NaN 0.285111 380127 380127
9 0 6 0-S7 0-L4 0-L8 0-S6 category_id 145.516006 1265.5 <= left NaN 0.283248 175707 175707
10 0 7 0-L4 None None 0-S7 None NaN NaN None None None 0.284015 164063 164063
11 0 7 0-L8 None None 0-S7 None NaN NaN None None None 0.272446 11644 11644
12 0 6 0-L7 None None 0-S6 None NaN NaN None None None 0.286712 204420 204420
13 0 3 0-L2 None None 0-S1 None NaN NaN None None None 0.278316 225095 225095
14 0 2 0-S5 0-L1 0-S8 0-S0 model_id 122.767998 5943291.5 <= left NaN 0.290997 303206 303206
15 0 3 0-L1 None None 0-S5 None NaN NaN None None None 0.293285 132269 132269
16 0 3 0-S8 0-L6 0-L9 0-S5 category_id 87.764198 7063.5 <= left NaN 0.289227 170937 170937
17 0 4 0-L6 None None 0-S8 None NaN NaN None None None 0.289540 167738 167738
18 0 4 0-L9 None None 0-S8 None NaN NaN None None None 0.272819 3199 3199

Добавим метку кластера в элемент сессии, в которую он попал

In [75]:
web.loc[web["page_type"].isin([1, 2, 4]), "cluster_id"] = tree.predict(
    web_cat, pred_leaf=True
)
In [76]:
web.head()
Out[76]:
sessionkey_id date_time page_type pageview_number pageview_duration_sec category_id model_id good_id price product_in_sale cluster_id
2268917 109996122 1975-10-17 13:42:56.953 2 1 11.0 722.0 NaN NaN NaN NaN 4.0
2268918 109996122 1975-10-17 13:43:07.510 2 2 22.0 7196.0 NaN NaN NaN NaN 2.0
2268919 109996122 1975-10-17 13:43:29.860 2 3 25.0 779.0 NaN NaN NaN NaN 4.0
2269206 109996122 1975-10-17 13:43:54.757 2 4 9.0 7196.0 NaN NaN NaN NaN 2.0
2267445 109996122 1975-10-17 13:44:03.803 2 5 11.0 723.0 NaN NaN NaN NaN 4.0

Теперь будем агрегировать инфу о кластерах

In [77]:
agg_params = {
    "session_length": ("sessionkey_id", lambda x: x.shape[0]),
    #
    "session_datetime_start": ("date_time", lambda x: x.iloc[0]),
    "session_datetime_end": ("date_time", lambda x: x.iloc[-1]),
    #
    "last_page_type": ("page_type", lambda x: x.iloc[-1]),
    **{
        f"page_type_{i}": ("page_type", partial(lambda x, i: x[x == i].count(), i=i))
        for i in (3, 6)
    },
    #
    # **{
    #     f"page_type_{i}": ("page_type", partial(lambda x, i: x[x == i].count(), i=i))
    #     for i in range(1, 13 + 1)
    # }, # Самыми полезными получились 3 и 6, чтоб долго не считать сделал только их
    #
    #
    "last_pageview_number": ("pageview_number", lambda x: x.max()),
    #
    "pageview_duration_sec_last": ("pageview_duration_sec", lambda x: x.iloc[-1]),
    "pageview_duration_sec_sum": ("pageview_duration_sec", lambda x: np.nansum(x)),
    "pageview_duration_sec_min": ("pageview_duration_sec", lambda x: x.min()),
    "pageview_duration_sec_max": ("pageview_duration_sec", lambda x: x.max()),
    #
    "categories": ("category_id", lambda x: set(x[~x.isna()].astype(int))),
    #
    "models": ("model_id", lambda x: set(x[~x.isna()].astype(int))),
    #
    "goods": ("good_id", lambda x: set(x[~x.isna()].astype(int))),
    #
    "price_min": ("price", lambda x: x.min()),
    "price_max": ("price", lambda x: x.max()),
    #
    **{
        f"cluster_id_{i}": ("cluster_id", partial(lambda x, i: x[x == i].count(), i=i))
        for i in range(10)
    },
}

web_aggregate = web.groupby("sessionkey_id", sort=False).agg(**agg_params)
In [78]:
web_aggregate["datetime_diff"] = (
    web_aggregate["session_datetime_end"] - web_aggregate["session_datetime_start"]
).dt.total_seconds()
web_aggregate["timedelta_1"] = (
    web_aggregate["datetime_diff"] - web_aggregate["pageview_duration_sec_sum"]
)

for i in (3, 6):
    web_aggregate[f"page_type_{i}_proportion"] = (
        web_aggregate[f"page_type_{i}"] / web_aggregate["session_length"]
    )

# for i in range(1, 13 + 1):
#     web_aggregate[f"page_type_{i}_proportion"] = (
#         web_aggregate[f"page_type_{i}"] / web_aggregate["session_length"]
#     )

web_aggregate.sample(5)
Out[78]:
session_length session_datetime_start session_datetime_end last_page_type page_type_3 page_type_6 last_pageview_number pageview_duration_sec_last pageview_duration_sec_sum pageview_duration_sec_min ... cluster_id_4 cluster_id_5 cluster_id_6 cluster_id_7 cluster_id_8 cluster_id_9 datetime_diff timedelta_1 page_type_3_proportion page_type_6_proportion
sessionkey_id
120615479 17 1975-12-10 12:22:21.907 1975-12-10 12:45:09.480 1 0 0 17 NaN 1368.0 9.0 ... 7 0 3 2 0 0 1367.573 -0.427 0.000000 0.000000
126391440 1 1976-01-09 10:45:30.097 1976-01-09 10:45:30.097 6 0 1 4 NaN 0.0 NaN ... 0 0 0 0 0 0 0.000 0.000 0.000000 1.000000
117443192 11 1975-11-25 14:15:39.247 1975-11-25 14:22:51.317 2 0 0 11 NaN 432.0 5.0 ... 0 0 0 6 0 0 432.070 0.070 0.000000 0.000000
115670979 30 1975-11-16 10:27:49.977 1975-11-16 10:39:49.263 2 1 1 35 NaN 534.0 0.0 ... 0 0 0 0 0 1 719.286 185.286 0.033333 0.033333
129596506 3 1976-01-23 07:54:26.957 1976-01-23 08:11:13.260 9 0 0 3 NaN 1007.0 16.0 ... 0 0 0 0 0 0 1006.303 -0.697 0.000000 0.000000

5 rows × 30 columns

In [79]:
X, y = transform(tr, web_aggregate)
In [80]:
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.25, shuffle=False)
In [81]:
train_dataset = lgb.Dataset(X_tr, y_tr, categorical_feature=cat_features)
val_dataset = lgb.Dataset(X_val, y_val, categorical_feature=cat_features)

model_clusters = lgb.train(
    {
        "boosting_type": "dart",
        "eta": 0.15,
        "objective": "binary",
        "metric": ["auc", ""],
        "neg_bagging_fraction": 0.2,
    },
    train_dataset,
    300,
    [val_dataset],
    ["Validation"],
    callbacks=[
        lgb.log_evaluation(3),
    ],
)
t_clusters = model_clusters.trees_to_dataframe()
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero
[LightGBM] [Info] Number of positive: 28110, number of negative: 50336
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.093527 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5760
[LightGBM] [Info] Number of data points in the train set: 78446, number of used features: 40
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.358336 -> initscore=-0.582595
[LightGBM] [Info] Start training from score -0.582595
[3]	Validation's auc: 0.957261
[6]	Validation's auc: 0.957913
[9]	Validation's auc: 0.958335
[12]	Validation's auc: 0.959779
[15]	Validation's auc: 0.960384
[18]	Validation's auc: 0.962166
[21]	Validation's auc: 0.963215
[24]	Validation's auc: 0.96378
[27]	Validation's auc: 0.963956
[30]	Validation's auc: 0.964226
[33]	Validation's auc: 0.964413
[36]	Validation's auc: 0.964544
[39]	Validation's auc: 0.964682
[42]	Validation's auc: 0.964783
[45]	Validation's auc: 0.964912
[48]	Validation's auc: 0.964983
[51]	Validation's auc: 0.964963
[54]	Validation's auc: 0.965076
[57]	Validation's auc: 0.965076
[60]	Validation's auc: 0.965144
[63]	Validation's auc: 0.965155
[66]	Validation's auc: 0.965159
[69]	Validation's auc: 0.965136
[72]	Validation's auc: 0.965118
[75]	Validation's auc: 0.965151
[78]	Validation's auc: 0.965196
[81]	Validation's auc: 0.964963
[84]	Validation's auc: 0.965073
[87]	Validation's auc: 0.965156
[90]	Validation's auc: 0.965156
[93]	Validation's auc: 0.965201
[96]	Validation's auc: 0.9652
[99]	Validation's auc: 0.965145
[102]	Validation's auc: 0.965152
[105]	Validation's auc: 0.965177
[108]	Validation's auc: 0.965248
[111]	Validation's auc: 0.965222
[114]	Validation's auc: 0.96521
[117]	Validation's auc: 0.965241
[120]	Validation's auc: 0.96526
[123]	Validation's auc: 0.965241
[126]	Validation's auc: 0.965242
[129]	Validation's auc: 0.965023
[132]	Validation's auc: 0.965168
[135]	Validation's auc: 0.965172
[138]	Validation's auc: 0.965179
[141]	Validation's auc: 0.965124
[144]	Validation's auc: 0.965155
[147]	Validation's auc: 0.965172
[150]	Validation's auc: 0.965205
[153]	Validation's auc: 0.965244
[156]	Validation's auc: 0.965233
[159]	Validation's auc: 0.965269
[162]	Validation's auc: 0.96539
[165]	Validation's auc: 0.965263
[168]	Validation's auc: 0.965247
[171]	Validation's auc: 0.965079
[174]	Validation's auc: 0.96517
[177]	Validation's auc: 0.965165
[180]	Validation's auc: 0.965123
[183]	Validation's auc: 0.965137
[186]	Validation's auc: 0.965121
[189]	Validation's auc: 0.965145
[192]	Validation's auc: 0.965119
[195]	Validation's auc: 0.965188
[198]	Validation's auc: 0.965137
[201]	Validation's auc: 0.965055
[204]	Validation's auc: 0.965114
[207]	Validation's auc: 0.965174
[210]	Validation's auc: 0.965186
[213]	Validation's auc: 0.9652
[216]	Validation's auc: 0.965047
[219]	Validation's auc: 0.965016
[222]	Validation's auc: 0.96492
[225]	Validation's auc: 0.964921
[228]	Validation's auc: 0.964938
[231]	Validation's auc: 0.964894
[234]	Validation's auc: 0.964995
[237]	Validation's auc: 0.965042
[240]	Validation's auc: 0.96501
[243]	Validation's auc: 0.965039
[246]	Validation's auc: 0.965047
[249]	Validation's auc: 0.96511
[252]	Validation's auc: 0.965115
[255]	Validation's auc: 0.965069
[258]	Validation's auc: 0.965039
[261]	Validation's auc: 0.965056
[264]	Validation's auc: 0.965096
[267]	Validation's auc: 0.965034
[270]	Validation's auc: 0.965041
[273]	Validation's auc: 0.965008
[276]	Validation's auc: 0.965048
[279]	Validation's auc: 0.965023
[282]	Validation's auc: 0.964977
[285]	Validation's auc: 0.965005
[288]	Validation's auc: 0.964979
[291]	Validation's auc: 0.965066
[294]	Validation's auc: 0.965078
[297]	Validation's auc: 0.965108
[300]	Validation's auc: 0.965139
In [ ]:
plot_feature_info(t_clusters)

8

Это идея тоже оказалась не самой полезной :/

In [83]:
web_aggregate.drop(
    columns=[f"cluster_id_{i}" for i in range(10)], errors="ignore", inplace=True
)

3. Ближайшие соседи (3 балла)¶

В силу того, что в понижении размерности получилось мало чего хорошего (из-за типов фичей), мне кажется не особо разумно пытаться использовать ближайших соседей. С кластеризацией ещё можно, что-то придумать на деревьях, но обычные деревья как раз и занимаются тем, что ищут похожие объекты. Так что я не буду изобретать велосипед :*

4. lightgbm: model.trees_to_dataframe (5 баллов)¶

Всё что должно быть в этом блоке лежит в EDA -> Обучение и анализ модели, потому что я использую это в первых пунктах

5. catboost: model.get_object_importance (4 + 1 балла)¶

In [85]:
train_dataset = lgb.Dataset(X_tr, y_tr, categorical_feature=cat_features)
val_dataset = lgb.Dataset(X_val, y_val, categorical_feature=cat_features)

model = lgb.train(
    {
        "boosting_type": "dart",
        "eta": 0.15,
        "objective": "binary",
        "metric": ["auc", ""],
        "neg_bagging_fraction": 0.2,
    },
    train_dataset,
    100,
    [val_dataset],
    ["Validation"],
    callbacks=[
        lgb.log_evaluation(3),
    ],
)
t = model.trees_to_dataframe()
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero
[LightGBM] [Info] Number of positive: 28110, number of negative: 50336
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.010657 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5760
[LightGBM] [Info] Number of data points in the train set: 78446, number of used features: 40
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.358336 -> initscore=-0.582595
[LightGBM] [Info] Start training from score -0.582595
[3]	Validation's auc: 0.957261
[6]	Validation's auc: 0.957913
[9]	Validation's auc: 0.958335
[12]	Validation's auc: 0.959779
[15]	Validation's auc: 0.960384
[18]	Validation's auc: 0.962166
[21]	Validation's auc: 0.963215
[24]	Validation's auc: 0.96378
[27]	Validation's auc: 0.963956
[30]	Validation's auc: 0.964226
[33]	Validation's auc: 0.964413
[36]	Validation's auc: 0.964544
[39]	Validation's auc: 0.964682
[42]	Validation's auc: 0.964783
[45]	Validation's auc: 0.964912
[48]	Validation's auc: 0.964983
[51]	Validation's auc: 0.964963
[54]	Validation's auc: 0.965076
[57]	Validation's auc: 0.965076
[60]	Validation's auc: 0.965144
[63]	Validation's auc: 0.965155
[66]	Validation's auc: 0.965159
[69]	Validation's auc: 0.965136
[72]	Validation's auc: 0.965118
[75]	Validation's auc: 0.965151
[78]	Validation's auc: 0.965196
[81]	Validation's auc: 0.964963
[84]	Validation's auc: 0.965073
[87]	Validation's auc: 0.965156
[90]	Validation's auc: 0.965156
[93]	Validation's auc: 0.965201
[96]	Validation's auc: 0.9652
[99]	Validation's auc: 0.965145
In [86]:
plot_scores(model, X_tr, y_tr, X_val, y_val)
No description has been provided for this image
In [87]:
tr_pool = cb.Pool(X_tr, y_tr)
val_pool = cb.Pool(X_val, y_val)

catboost = cb.train(
    tr_pool,
    {"iterations": 100, "eval_metric": "AUC", "loss_function": "Logloss"},
    eval_set=val_pool,
)
Learning rate set to 0.253436
0:	test: 0.9505862	best: 0.9505862 (0)	total: 55.4ms	remaining: 5.48s
1:	test: 0.9523762	best: 0.9523762 (1)	total: 60.7ms	remaining: 2.97s
2:	test: 0.9563567	best: 0.9563567 (2)	total: 65.1ms	remaining: 2.1s
3:	test: 0.9567952	best: 0.9567952 (3)	total: 69.6ms	remaining: 1.67s
4:	test: 0.9586227	best: 0.9586227 (4)	total: 74.2ms	remaining: 1.41s
5:	test: 0.9594953	best: 0.9594953 (5)	total: 78.7ms	remaining: 1.23s
6:	test: 0.9600629	best: 0.9600629 (6)	total: 83.3ms	remaining: 1.11s
7:	test: 0.9607234	best: 0.9607234 (7)	total: 87.9ms	remaining: 1.01s
8:	test: 0.9607205	best: 0.9607234 (7)	total: 92.5ms	remaining: 935ms
9:	test: 0.9608740	best: 0.9608740 (9)	total: 97ms	remaining: 873ms
10:	test: 0.9611348	best: 0.9611348 (10)	total: 102ms	remaining: 822ms
11:	test: 0.9611966	best: 0.9611966 (11)	total: 106ms	remaining: 780ms
12:	test: 0.9614189	best: 0.9614189 (12)	total: 111ms	remaining: 742ms
13:	test: 0.9617673	best: 0.9617673 (13)	total: 115ms	remaining: 709ms
14:	test: 0.9620421	best: 0.9620421 (14)	total: 120ms	remaining: 682ms
15:	test: 0.9620531	best: 0.9620531 (15)	total: 125ms	remaining: 655ms
16:	test: 0.9624287	best: 0.9624287 (16)	total: 129ms	remaining: 632ms
17:	test: 0.9627692	best: 0.9627692 (17)	total: 134ms	remaining: 610ms
18:	test: 0.9628372	best: 0.9628372 (18)	total: 138ms	remaining: 589ms
19:	test: 0.9628297	best: 0.9628372 (18)	total: 142ms	remaining: 570ms
20:	test: 0.9628854	best: 0.9628854 (20)	total: 147ms	remaining: 552ms
21:	test: 0.9629522	best: 0.9629522 (21)	total: 151ms	remaining: 535ms
22:	test: 0.9628709	best: 0.9629522 (21)	total: 155ms	remaining: 520ms
23:	test: 0.9628682	best: 0.9629522 (21)	total: 160ms	remaining: 505ms
24:	test: 0.9629931	best: 0.9629931 (24)	total: 164ms	remaining: 492ms
25:	test: 0.9630040	best: 0.9630040 (25)	total: 168ms	remaining: 479ms
26:	test: 0.9630508	best: 0.9630508 (26)	total: 173ms	remaining: 467ms
27:	test: 0.9630695	best: 0.9630695 (27)	total: 177ms	remaining: 456ms
28:	test: 0.9631592	best: 0.9631592 (28)	total: 182ms	remaining: 445ms
29:	test: 0.9631399	best: 0.9631592 (28)	total: 187ms	remaining: 435ms
30:	test: 0.9631235	best: 0.9631592 (28)	total: 191ms	remaining: 425ms
31:	test: 0.9632781	best: 0.9632781 (31)	total: 196ms	remaining: 416ms
32:	test: 0.9632676	best: 0.9632781 (31)	total: 200ms	remaining: 406ms
33:	test: 0.9633529	best: 0.9633529 (33)	total: 205ms	remaining: 398ms
34:	test: 0.9634088	best: 0.9634088 (34)	total: 209ms	remaining: 388ms
35:	test: 0.9634287	best: 0.9634287 (35)	total: 213ms	remaining: 379ms
36:	test: 0.9634687	best: 0.9634687 (36)	total: 218ms	remaining: 371ms
37:	test: 0.9634698	best: 0.9634698 (37)	total: 222ms	remaining: 362ms
38:	test: 0.9634747	best: 0.9634747 (38)	total: 226ms	remaining: 354ms
39:	test: 0.9634713	best: 0.9634747 (38)	total: 231ms	remaining: 346ms
40:	test: 0.9635600	best: 0.9635600 (40)	total: 235ms	remaining: 339ms
41:	test: 0.9635121	best: 0.9635600 (40)	total: 240ms	remaining: 331ms
42:	test: 0.9635283	best: 0.9635600 (40)	total: 244ms	remaining: 324ms
43:	test: 0.9635406	best: 0.9635600 (40)	total: 248ms	remaining: 316ms
44:	test: 0.9631217	best: 0.9635600 (40)	total: 253ms	remaining: 309ms
45:	test: 0.9631152	best: 0.9635600 (40)	total: 258ms	remaining: 302ms
46:	test: 0.9630669	best: 0.9635600 (40)	total: 262ms	remaining: 296ms
47:	test: 0.9630985	best: 0.9635600 (40)	total: 266ms	remaining: 289ms
48:	test: 0.9631310	best: 0.9635600 (40)	total: 271ms	remaining: 282ms
49:	test: 0.9631896	best: 0.9635600 (40)	total: 275ms	remaining: 275ms
50:	test: 0.9631818	best: 0.9635600 (40)	total: 279ms	remaining: 269ms
51:	test: 0.9632181	best: 0.9635600 (40)	total: 284ms	remaining: 262ms
52:	test: 0.9632117	best: 0.9635600 (40)	total: 288ms	remaining: 255ms
53:	test: 0.9632418	best: 0.9635600 (40)	total: 292ms	remaining: 249ms
54:	test: 0.9631888	best: 0.9635600 (40)	total: 297ms	remaining: 243ms
55:	test: 0.9631792	best: 0.9635600 (40)	total: 301ms	remaining: 236ms
56:	test: 0.9633803	best: 0.9635600 (40)	total: 305ms	remaining: 230ms
57:	test: 0.9634897	best: 0.9635600 (40)	total: 310ms	remaining: 224ms
58:	test: 0.9635149	best: 0.9635600 (40)	total: 315ms	remaining: 219ms
59:	test: 0.9634975	best: 0.9635600 (40)	total: 319ms	remaining: 213ms
60:	test: 0.9634957	best: 0.9635600 (40)	total: 323ms	remaining: 207ms
61:	test: 0.9634748	best: 0.9635600 (40)	total: 327ms	remaining: 201ms
62:	test: 0.9634809	best: 0.9635600 (40)	total: 332ms	remaining: 195ms
63:	test: 0.9634680	best: 0.9635600 (40)	total: 336ms	remaining: 189ms
64:	test: 0.9634749	best: 0.9635600 (40)	total: 341ms	remaining: 183ms
65:	test: 0.9634842	best: 0.9635600 (40)	total: 345ms	remaining: 177ms
66:	test: 0.9635098	best: 0.9635600 (40)	total: 349ms	remaining: 172ms
67:	test: 0.9634321	best: 0.9635600 (40)	total: 353ms	remaining: 166ms
68:	test: 0.9634847	best: 0.9635600 (40)	total: 358ms	remaining: 161ms
69:	test: 0.9634847	best: 0.9635600 (40)	total: 362ms	remaining: 155ms
70:	test: 0.9635474	best: 0.9635600 (40)	total: 366ms	remaining: 149ms
71:	test: 0.9635476	best: 0.9635600 (40)	total: 370ms	remaining: 144ms
72:	test: 0.9635591	best: 0.9635600 (40)	total: 375ms	remaining: 139ms
73:	test: 0.9635429	best: 0.9635600 (40)	total: 379ms	remaining: 133ms
74:	test: 0.9635252	best: 0.9635600 (40)	total: 383ms	remaining: 128ms
75:	test: 0.9635370	best: 0.9635600 (40)	total: 388ms	remaining: 123ms
76:	test: 0.9635671	best: 0.9635671 (76)	total: 392ms	remaining: 117ms
77:	test: 0.9635759	best: 0.9635759 (77)	total: 397ms	remaining: 112ms
78:	test: 0.9635837	best: 0.9635837 (78)	total: 401ms	remaining: 107ms
79:	test: 0.9636266	best: 0.9636266 (79)	total: 405ms	remaining: 101ms
80:	test: 0.9636528	best: 0.9636528 (80)	total: 410ms	remaining: 96.1ms
81:	test: 0.9636463	best: 0.9636528 (80)	total: 414ms	remaining: 90.9ms
82:	test: 0.9636482	best: 0.9636528 (80)	total: 419ms	remaining: 85.8ms
83:	test: 0.9636153	best: 0.9636528 (80)	total: 423ms	remaining: 80.6ms
84:	test: 0.9636393	best: 0.9636528 (80)	total: 427ms	remaining: 75.4ms
85:	test: 0.9636746	best: 0.9636746 (85)	total: 432ms	remaining: 70.3ms
86:	test: 0.9636464	best: 0.9636746 (85)	total: 436ms	remaining: 65.1ms
87:	test: 0.9621741	best: 0.9636746 (85)	total: 440ms	remaining: 60ms
88:	test: 0.9621743	best: 0.9636746 (85)	total: 444ms	remaining: 54.9ms
89:	test: 0.9621426	best: 0.9636746 (85)	total: 448ms	remaining: 49.8ms
90:	test: 0.9621527	best: 0.9636746 (85)	total: 453ms	remaining: 44.8ms
91:	test: 0.9621559	best: 0.9636746 (85)	total: 457ms	remaining: 39.7ms
92:	test: 0.9621843	best: 0.9636746 (85)	total: 461ms	remaining: 34.7ms
93:	test: 0.9622250	best: 0.9636746 (85)	total: 465ms	remaining: 29.7ms
94:	test: 0.9622106	best: 0.9636746 (85)	total: 470ms	remaining: 24.7ms
95:	test: 0.9621857	best: 0.9636746 (85)	total: 474ms	remaining: 19.8ms
96:	test: 0.9622258	best: 0.9636746 (85)	total: 479ms	remaining: 14.8ms
97:	test: 0.9622152	best: 0.9636746 (85)	total: 483ms	remaining: 9.85ms
98:	test: 0.9622289	best: 0.9636746 (85)	total: 487ms	remaining: 4.92ms
99:	test: 0.9623737	best: 0.9636746 (85)	total: 491ms	remaining: 0us

bestTest = 0.9636745675
bestIteration = 85

Shrink model to first 86 iterations.
In [88]:
y_val_raw = catboost.predict(X_val, prediction_type="RawFormulaVal")
In [89]:
plt.title("Validation raw")
sns.histplot(x=y_val_raw, hue=y_val, bins=33)
plt.show()
No description has been provided for this image

Как и в lightgbm в catboost у нас есть области примерно от -2 до 2, где модель неуверенна в ответе и ошибается. Посмотрим какие объекты влияют на это

In [90]:
undefined_pool = cb.Pool(X_val[np.abs(y_val_raw) < 2], y_val[np.abs(y_val_raw) < 2])
undefined_pool.shape
Out[90]:
(7155, 40)
In [91]:
indeces, scores = catboost.get_object_importance(
    undefined_pool.slice(np.arange(10)),
    tr_pool,
    top_size=5,  # не влияет на скорость
    type="PerObject",  # не влияет на скорость
    update_method="SinglePoint",  # очень влияет на скорость! лучше SinglePoint :)
    importance_values_sign="All",  # не влияет на скорость
    thread_count=32,
)
In [92]:
for i in range(10):
    print(f"index={i}, real_target={y_val[np.abs(y_val_raw) < 2][i]}")
    display(X_tr.iloc[indeces[i]][["page_type_3", "timedelta_3", "timedelta_1"]])
    display(y_tr[indeces[i]])
index=0, real_target=1
page_type_3 timedelta_3 timedelta_1
order_id
1175248 0.0 -595598.777 -0.024
1164495 0.0 -519842.500 -0.330
1253852 2.0 -434375.977 -0.484
1193397 0.0 -526297.223 0.030
1171789 0.0 -568726.220 0.487
array([1, 1, 1, 1, 1])
index=1, real_target=1
page_type_3 timedelta_3 timedelta_1
order_id
1259279 0.0 -320871.673 -0.230
1202542 1.0 -2119.147 -24.567
1281626 0.0 -7150.057 0.720
1234980 0.0 -9649.763 -0.236
1297762 0.0 -14707.633 -0.333
array([0, 1, 1, 0, 0])
index=2, real_target=0
page_type_3 timedelta_3 timedelta_1
order_id
1302382 5.0 281.060 295.717
1314834 3.0 586.417 1289.560
1328959 10.0 125.847 713.440
1212635 2.0 1668.270 1258.523
1231565 5.0 4366.357 880.744
array([1, 1, 1, 1, 1])
index=3, real_target=0
page_type_3 timedelta_3 timedelta_1
order_id
1195237 1.0 -2382.353 -42.306
1301008 2.0 -1421.270 -0.130
1211465 1.0 -320.433 -17.820
1334601 0.0 -1170.037 0.750
1236722 1.0 -1861.740 0.117
array([1, 0, 1, 0, 0])
index=4, real_target=0
page_type_3 timedelta_3 timedelta_1
order_id
1213396 2.0 -72394.113 0.524
1338806 2.0 -413448.743 -15.860
1253422 0.0 -311268.510 0.000
1261851 0.0 -605585.217 0.130
1180464 0.0 -410573.487 -0.294
array([1, 1, 1, 1, 1])
index=5, real_target=0
page_type_3 timedelta_3 timedelta_1
order_id
1262444 0.0 -519054.693 0.0
1288013 0.0 -358709.213 0.0
1283500 0.0 -436569.980 0.0
1181366 0.0 -389951.583 0.0
1245312 0.0 -518868.227 0.0
array([1, 1, 1, 1, 1])
index=6, real_target=0
page_type_3 timedelta_3 timedelta_1
order_id
1262921 1.0 -178.763 -23.976
1324028 0.0 -4629.663 -0.200
1264136 2.0 -1497.023 25.687
1278684 1.0 -10831.283 0.694
1302861 0.0 -3170.397 -0.294
array([1, 1, 0, 1, 1])
index=7, real_target=1
page_type_3 timedelta_3 timedelta_1
order_id
1218925 0.0 -6455.033 0.0
1324091 0.0 -12338.623 0.0
1332701 0.0 -74900.727 0.0
1292198 0.0 -12986.190 0.0
1331822 0.0 -28757.787 0.0
array([0, 0, 0, 0, 0])
index=8, real_target=1
page_type_3 timedelta_3 timedelta_1
order_id
1210623 4.0 -357.677 376.873
1292705 4.0 -1739.663 168.614
1276537 2.0 -300.260 411.833
1160067 2.0 -310.787 169.263
1228244 1.0 -2013.380 1515.240
array([0, 0, 1, 0, 0])
index=9, real_target=0
page_type_3 timedelta_3 timedelta_1
order_id
1262444 0.0 -519054.693 0.000
1212600 0.0 -256489.260 0.440
1164070 0.0 -432832.037 0.186
1177533 0.0 -408052.063 0.607
1313622 0.0 -502089.843 0.000
array([1, 1, 1, 1, 1])

Заметим, что для индексов (4, 5, 9) реальная метка - 0, но при этом самые важные объекты для них с меткой 1 и при этом с огромным timedelta_3 похожим на выброс. Так что может быть если пофиксить эту штуку можно получить больший скор

In [93]:
timedelta_3_clip_lower = -100000  # По хорошему тут бы подобрать константу
X_tr["timedelta_3"].clip(timedelta_3_clip_lower, inplace=True)
X_val["timedelta_3"].clip(timedelta_3_clip_lower, inplace=True)
In [94]:
fig, ax = plt.subplots(1, 2, figsize=(18, 6))
sns.histplot(x=X_tr["timedelta_3"], hue=y_tr, bins=33, ax=ax[0])
sns.histplot(x=X_val["timedelta_3"], hue=y_val, bins=33, ax=ax[1])
plt.show()
No description has been provided for this image
In [96]:
train_dataset = lgb.Dataset(X_tr, y_tr, categorical_feature=cat_features)
val_dataset = lgb.Dataset(X_val, y_val, categorical_feature=cat_features)

lgbm_clf = lgb.train(
    {
        "boosting_type": "dart",
        "eta": 0.15,
        "objective": "binary",
        "metric": ["auc", ""],
        "neg_bagging_fraction": 0.2,
    },
    train_dataset,
    100,
    [val_dataset],
    ["Validation"],
    callbacks=[
        lgb.log_evaluation(3),
    ],
)
t = lgbm_clf.trees_to_dataframe()
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero
[LightGBM] [Info] Number of positive: 28110, number of negative: 50336
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005091 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5760
[LightGBM] [Info] Number of data points in the train set: 78446, number of used features: 40
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.358336 -> initscore=-0.582595
[LightGBM] [Info] Start training from score -0.582595
[3]	Validation's auc: 0.955965
[6]	Validation's auc: 0.957652
[9]	Validation's auc: 0.958019
[12]	Validation's auc: 0.959067
[15]	Validation's auc: 0.960102
[18]	Validation's auc: 0.962364
[21]	Validation's auc: 0.963027
[24]	Validation's auc: 0.963588
[27]	Validation's auc: 0.963997
[30]	Validation's auc: 0.964133
[33]	Validation's auc: 0.964295
[36]	Validation's auc: 0.964474
[39]	Validation's auc: 0.964641
[42]	Validation's auc: 0.964692
[45]	Validation's auc: 0.964785
[48]	Validation's auc: 0.96481
[51]	Validation's auc: 0.964913
[54]	Validation's auc: 0.964931
[57]	Validation's auc: 0.96493
[60]	Validation's auc: 0.965037
[63]	Validation's auc: 0.96514
[66]	Validation's auc: 0.96517
[69]	Validation's auc: 0.965154
[72]	Validation's auc: 0.965185
[75]	Validation's auc: 0.965147
[78]	Validation's auc: 0.965166
[81]	Validation's auc: 0.965195
[84]	Validation's auc: 0.965306
[87]	Validation's auc: 0.965371
[90]	Validation's auc: 0.965368
[93]	Validation's auc: 0.965463
[96]	Validation's auc: 0.965445
[99]	Validation's auc: 0.965438

Стало чуть лучше. Но сильного роста не произошло :/

6. SHAP (5 баллов)¶

In [97]:
lgbm_explainer = shap.TreeExplainer(model)
lgbm_shap_values = lgbm_explainer(X_val)
shap.plots.beeswarm(lgbm_shap_values, max_display=10)
No description has been provided for this image

lgbm использует:

  • page_type_3 и page_type_6 в формате "> 0" -> 0 и "== 0" -> 1
  • pageview_duration_sec_last: "not nan" -> 0 и "nan" -> 1
In [98]:
cb_explainer = shap.TreeExplainer(catboost)
cb_shap_values = cb_explainer(X_val)
shap.plots.beeswarm(cb_shap_values, max_display=10)
No description has been provided for this image

catboost почему-то вообще не использует page_type_3. Возможно поэтому у него меньше скор, но что бы я не делал он его не выводит на 1 место, поэтому я думаю пользоваться lgbm (Как я это делал не сохранилось, но я надеюсь на веру на слово)

Так что тут тоже не получилось извлечь какую-то выгоды для задачи

Сдача¶

Я решил гордо не оптюнить!

In [215]:
X["timedelta_3"].clip(timedelta_3_clip_lower, inplace=True)
train_dataset = lgb.Dataset(X, y, categorical_feature=cat_features)

model = lgb.train(
    {
        "boosting_type": "dart",
        "eta": 0.15,
        "objective": "binary",
        "metric": ["auc", ""],
        "neg_bagging_fraction": 0.2,
    },
    train_dataset,
    100,
)

X_tst = transform(tst, web_aggregate)
X_tst["timedelta_3"].clip(timedelta_3_clip_lower, inplace=True)

submission = pd.read_csv(SAMPLE_SUBMISSION_PATH, index_col="order_id")
submission["is_callcenter"] = model.predict(X_tst)
submission.to_csv(SUBMISSION_PATH)
submission
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero
[LightGBM] [Info] Number of positive: 37099, number of negative: 67496
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001702 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5430
[LightGBM] [Info] Number of data points in the train set: 104595, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.354692 -> initscore=-0.598478
[LightGBM] [Info] Start training from score -0.598478
Out[215]:
is_callcenter
order_id
1350922 0.008653
1354989 0.012926
1352637 0.503872
1350050 0.751182
1341733 0.272487
... ...
1358397 0.231185
1357968 0.016199
1358835 0.990611
1365692 0.114166
1365429 0.007388

17196 rows × 1 columns

У катбуста я решил сделать модель на 20 деревьях и сделать идею с булевым page_type_3, чтобы не переобучится

In [219]:
X["timedelta_3"].clip(timedelta_3_clip_lower, inplace=True)
X["page_type_3"] = (X["page_type_3"] > 0).astype(int)
tr_pool = cb.Pool(X, y)

catboost = cb.train(
    tr_pool, {"iterations": 20, "eval_metric": "AUC", "loss_function": "Logloss"}
)

X_tst = transform(tst, web_aggregate)
X_tst["timedelta_3"].clip(timedelta_3_clip_lower, inplace=True)
X_tst["page_type_3"] = (X_tst["page_type_3"] > 0).astype(int)

submission = pd.read_csv(SAMPLE_SUBMISSION_PATH, index_col="order_id")
submission["is_callcenter"] = catboost.predict(X_tst, prediction_type="Probability")[
    :, 1
]
submission.to_csv(SUBMISSION_PATH)
submission
Learning rate set to 0.5
0:	total: 5.21ms	remaining: 99ms
1:	total: 9.46ms	remaining: 85.2ms
2:	total: 14.1ms	remaining: 79.9ms
3:	total: 18.8ms	remaining: 75.1ms
4:	total: 23ms	remaining: 69ms
5:	total: 27.5ms	remaining: 64.1ms
6:	total: 31.8ms	remaining: 59.1ms
7:	total: 36.5ms	remaining: 54.7ms
8:	total: 40.5ms	remaining: 49.5ms
9:	total: 45.2ms	remaining: 45.2ms
10:	total: 49.4ms	remaining: 40.4ms
11:	total: 53.7ms	remaining: 35.8ms
12:	total: 57.7ms	remaining: 31.1ms
13:	total: 62.1ms	remaining: 26.6ms
14:	total: 66ms	remaining: 22ms
15:	total: 70.6ms	remaining: 17.6ms
16:	total: 74.7ms	remaining: 13.2ms
17:	total: 79.2ms	remaining: 8.79ms
18:	total: 83.3ms	remaining: 4.38ms
19:	total: 87.5ms	remaining: 0us
Out[219]:
is_callcenter
order_id
1350922 0.007815
1354989 0.011382
1352637 0.474446
1350050 0.663885
1341733 0.223427
... ...
1358397 0.244059
1357968 0.010922
1358835 0.971819
1365692 0.169072
1365429 0.005213

17196 rows × 1 columns